def total_time(total_time):
time = float(round(total_time, 2))
print(f"Total Time: {time} Sec")
def tanimoto_matrix(fp_list):
"""
Calculate tanimoto distance matrix.
"""
matrix = []
for i in range(1, len(fp_list)):
# Compare the current fingerprint against all the previous ones in the list
similarities = DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i])
# Since we need a distance matrix, calculate 1-x for every element in similarity matrix
matrix.extend([1 - x for x in similarities])
return matrix
def cluster_fingerprints(fingerprints, cutoff=0.2):
"""
Cluster fingerprints.
:param fingerprints:
molecular fingerprint.
:param cutoff:
set the cluster threshold.
:return:
"""
# matrix
distance_matrix = tanimoto_matrix(fingerprints)
# cluster
clusters = Butina.ClusterData(distance_matrix, len(fingerprints), cutoff, isDistData=True)
clusters = sorted(clusters, key=len, reverse=True)
return clustersTesting BitBIRCH-Lean
Where I Test Different Clusteirng Methods
Introduction
As the number of known small-molecules increase, it becomes difficult to analyze each one. Clustering is an important technique that simplifies this, allowing us to group similar structures and sample a representative structure from the group instead. A typical scenario would be trying to select diverse structures for further testing. It makes sense to sample a couple of molecules from the same cluster instead of testing the entire cluster of structurally similar compounds. The issue with clustering occurs when our chemical libraries increase in size. With commercial libraries hitting tens of billions of molecules, clustering with common algorithms can take a long time.
Enter BitBIRCH (Published 2025 in Digital Discovery). Researchers claim that BitBIRCH is > 1,000 times faster than Butina clustering for libraries with 1,500,000 molecules. They also show BitBIRCH taking 5 hours to cluster 1 billion molecules. Not too shabby!
I’ve heared about BitBIRCH online, but it is only in the new year that I was able to make some tests. Here is a good one, where I clustered chembl-33 natural products subset using BitBIRCH, Butina, and KMeans.
TL;DR
I tested BitBIRCH and Butina clustering on a set of 1000 molecules. The speeds on my machine, M2 Pro, are:
- BitBIRCH: 0.02 Seconds
- Butina: 0.11 Seconds
Import Statements
The authors to the BitBIRCH paper provides a nice repository for their package, bblean. It comes with quite a few quality of life functions, but I will be mainly focusing on the clustering aspect. A full breakdown of bblean can be found on their documentation here. And of course, it is pip installable:
pip install bblean
::: {.cell ExecuteTime='{"end_time":"2026-01-28T10:07:34.308046Z","start_time":"2026-01-28T10:07:33.973592Z"}' execution_count=1}
``` {.python .cell-code}
import bblean
from rdkit import DataStructs, Chem
from rdkit.ML.Cluster import Butina
from rdkit.Chem import rdFingerprintGenerator
import time:::
Helper Functions
First, I will create some helper funtions. These are used to calulate how long the function runs and prints it as seconds. I will also have helpfer functions to calculate Butina clustering.
Load Dataset
The testing dataset was taken from the bblean repository. I included it here for this example.
The package comes wtih an easy way to create molecular fingerprints. Using their default method, we can pack the fingerprints. This compresses a typical molecualr fringerprint every 8 bits. So a 2048 bit fingerprint will be compressed into 256 bits. That can save a lot of memory for large molecular libraries.
To speed this notebook up, only the first 1,000 molecules are selected for clustering (During testing, I found that the whole set took too long for the Butina).
# Create the fingerprints and pack them into a numpy array, starting from a *.smi file
smiles = bblean.load_smiles("datasets/18-clustering-with-bitbirch/chembl-33-natural-products-subset.smi")
# take 1000 molecules
smiles = smiles[:1000]
# calculate fingerprint
fps_bb = bblean.fps_from_smiles(smiles, pack=True, n_features=2048, kind="rdkit")
print(f"The number of molecules for clustering: {len(smiles)}")The number of molecules for clustering: 1000
BitBIRCH Clustering
Keeping to their quickstart tutorial, I ran their BitBIRCH clustering using their default methods. Running this shows it takes roughly 2 seconds
# record time
start = time.time()
# bitbirch clustering
tree = bblean.BitBirch()
tree.fit(fps_bb)
# end time
end = time.time()
total_time(end - start)Total Time: 0.02 Sec
Butina Clustering
Unfortunatley, during my tests, I could not direclty input the fingerprints calculated using bblean direclty into RDKit. So the fingerprints had to be recalcualted here. The fingerprints are also RDKit fingerprints witha bit size of 2048. Remember, these fingerprints are not “packed” like the one used in BitBIRCH above.
# list of RDKit molecule objects
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
# rdkit fingerprint
fp_gen = rdFingerprintGenerator.GetRDKitFPGenerator(fpSize=2048)
rdkit_fps = [fp_gen.GetFingerprint(x) for x in mols]# record time
start = time.time()
# butina clustering
clusters = cluster_fingerprints(rdkit_fps)
# end time
end = time.time()
total_time(end-start)Total Time: 0.11 Sec
BitBirch Clustering - Unpack
While BitBIRCH looks much faster than Butina clustering, remember that that test used the “pack” parameter, compressing the molecular fingerprint bits to 256. To see how this would affect BitBIRCH, I also ran code on the fingerprints that were unpacked.
# unpacked fingerprints
fps_bb_unpacked = bblean.fps_from_smiles(smiles, pack=False, n_features=2048, kind="rdkit")
# record time
start = time.time()
# bitbirch clustering
tree = bblean.BitBirch()
tree.fit(fps_bb)
# end time
end = time.time()
total_time(end - start)Total Time: 0.02 Sec
Conclusion
So the speeds for 1000 molecules are:
- BitBIRCH: 0.02 Seconds
- Butina: 0.11 Seconds
That is a huge speed up. Originally I was going to cluster the whole molecule set. That is 64,086 molecules. The BitBIRCH clustered this set in a couple of seconds. The Butina took… well I lost patience after 30 minutes 🤣.
BitBIRCH looks like a very handle tool, especially for large molecular libraries. They also come equipped with fancy plot functions. Maybe I will explore those in the future.